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ABSTRACT 


This  paper  shows  that  some  implementations  of  fault-tolerant 
systems  with  dynamic  error  detection  and  reconfiguration  mechanisms  may 
not  recover  from  certain  types  of  temporary  failures.  An  experiment  is 
conducted  to  study  the  effect  of  temporary  failures  on  the  behavior  of  a 
dynamically  redundant  fault-tolerant  system.  The  system  is  built  out  of 
LSTTL  catalog  parts.  Transient  failures  are  induced  by  reducing  the 
power  supply  voltage;  intermittent  failures  are  induced  by  loading  nodes 
in  the  system.  Reducing  the  power  supply  voltage  produces  common-mode 
failures  that  can  be  detected  if  the  recovery  mechanism  produces  high 
amplitude  oscillations  when  its  inputs  are  near  the  threshold  level. 
Intermittent  failures  can  be  detected  if  the  recovery  mechanism  detects 
errors  before  incorrect  data  is  transmitted  through  the  output  devices. 
It  is  shown  that  the  stuck-at  fault  model  is  inappropriate  for  the 
temporary  failures  injected  into  the  system.  Techniques  are  suggested 
that  will  guarantee  detection  of  many  transient  and  intermittent 
failures . 


Keywords:  Fault-tolerant  systems,  Temporary  failures,  Dynamic  recovery 
mechanisms. 
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1  INTRODUCTION 


Fault-tolerant  schemes  often  require  hardware  replication.  A 
fault-tolerant  system  will  recover  from  a  fault  if  that  fault  does  not 
simultaneously  affect  too  many  of  the  replicated  modules.  Consider  a 
Triple  Modular  Redundant  (TMR)  system  [Siewiorek  82].  It  consists  of 
three  identical  modules  and  a  voter.  The  system  will  operate  correctly 
if  the  voter  and  at  least  two  of  the  three  modules  are  fault-free.  If  a 
fault  simultaneously  affects  two  or  more  of  the  modules  (common-mode 
failure),  the  system  may  produce  erroneous  outputs. 

While  the  behavior  of  fault-tolerant  systems  in  the  presence  of 
permanent  failures  is  well  established,  the  effect  of  temporary  failures 
on  these  systems  is  not  well  understood.  Temporary  failures  can  be 
divided  into  transient  and  intermittent  [CSrtes  86b]  [McCluskey  86] 
[Siewiorek  82].  Transient  failures  are  nonrecurring  temporary  failures 
caused  by  some  externally  induced  signal  perturbation  usually  due  to 
radiation,  power  supply  fluctuation,  etc.  Intermittent  failures  are 
recurring  temporary  failures  caused  by  component  degradation  or  poor 
design  (violation  of  operating  margins).  The  frequency  of  temporary 
failures  is  much  higher  than  that  of  permanent  failures.  Experiments 
show  that  at  least  80S  of  system  failures  are  due  to  temporary  failures 
[Iyer  82]  [Siewiorek  82].  Therefore,  it  is  very  important  to  verify 
that  fault-tolerant  systems  can  recover  from  temporary  failures. 

[CSrtes  87]  has  a  good  survey  about  temporary  failures.  In  this 
survey,  it  is  mentioned  that  temporary  failures  are  modeled  as  random 


stuck-at  faults.  Temporary  failures  affecting  memory  cells  can  be 
modeled  by  the  stuck-at  fault  model.  Alpha  particles  can  cause  a  cell 
to  temporarily  store  an  incorrect  value.  The  effect  of  the  failure  will 
disappear  when  the  information  in  the  cell  is  overwritten.  On  the  other 
hand,  [CSrtes  86a]  shows  that,  in  some  cases,  temporary  failures  cannot 
be  modeled  by  the  stuck-at  fault  model.  Therefore,  it  is  important  to 
verify  that  the  stuck-at  fault  model  can  accurately  explain  the  behavior 
of  a  fault-tolerant  system  with  a  temporary  failure. 

In  this  paper,  a  simple  fault-tolerant  system  is  described.  The 
fault-tolerant  technique  used  is  dynamic  redundancy  [Lala  85]  [Siewiorek 
82].  A  dynamically  redundant  system  consists  of  several  identical 
modules  with  only  one  of  them  operating  at  a  time  (the  active  module). 
The  other  modules  serve  as  spares  and  replace  the  active  module  in  case 
an  error  is  detected  in  it.  Dynamic  redundancy  requires  concurrent  (on¬ 
line)  error  detection  and  reconfiguration.  The  system  under  study  is 
built  out  of  LSTTL  catalog  parts  (7HLSxx  series).  It  is  thrn  stressed 
using  the  methods  described  in  tCSrtes  86b]  to  induce  certain  types  of 
transient  and  intermittent  failures.  The  output  of  the  system  is 
monitored  in  order  to  study  the  effectiveness  of  the  error  detection  and 
reconf iguration  circuitry  in  the  presence  of  temporary  failures.  It  is 
shown  that,  for  this  specific  implementation  of  dynamic  redundancy,  some 
temporary  failures  are  either  not  detected  or  the  system  is  unable  to 
reconfigure  successfully.  Furthermore,  it  is  found  that  the  stuck-at 
fault  model  is  inappropr iate  for  temporary  failures  and  finally, 
techniques  are  suggested  that  will  guarantee  detection  of  many  transient 


2  SYSTEM  DESCRIPTION 


The  function  of  the  system  under  study  is  to  perform  the  inclusive 
OR  of  its  two  inputs.  The  system  uses  dynamic  redundancy  for  fault- 
tolerance  [Losq  75].  Figure  1  shows  the  fault-tolerant  system.  It 
consists  of  two  identical  modules,  X  and  Y,  and  a  recovery  mechanism. 
Module  X  is  the  active  module  and  module  Y  is  the  powered  spare.  Module 
X  (or  Y)  consists  of  two  OR  gates.  The  recovery  mechanism  consists  of  a 
detector  and  a  switch.  The  detector  is  an  XOR  gate  that  compares  the 
outputs  of  the  two  OR  gates  in  module  X.  The  switch  consists  of  a  J-K 
flip-flop  and  four  buffers  with  tri-state  outputs.  Initially,  the  J-K 
flip-flop  is  reset  (Q=0  and  Q'=1)  and  the  outputs  of  the  OR  gates  in 
module  X  are  connected  to  the  system  bus  (bus-outl  and  bus-out2) .  If 
the  XOR  gate  detects  a  discrepancy  between  the  outputs  of  the  two  OR 
gates  in  module  X,  its  output  goes  to  1,  the  flip-flop  output  goes  to  1 
(0=1),  module  X  is  disconnected  from  the  bus  and  module  Y  is  connected. 
A  system  failure  occurs  when  incorrect  data  is  transmitted  over  the 
system  bus . 

Figure  2  shows  the  system  failure  detector.  This  circuit  compares 
the  outputs  of  the  fault-tolerant  system  (bus-outl  and  bus-out2)  to  a 
reference  output  (ref-out).  If  the  three  signals  disagree,  the  system 
failure  signal  (sys-fail)  goes  from  1  to  0  (active  low). 

Figure  3  shows  the  experimental  setup.  The  lower  box  contains  the 
actual  fault-tolerant  system.  It  has  two  74LS32  (quadruple  2-input  OR 
gates)  chips,  one  for  the  active  module  and  one  for  the  spare  module. 
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Fig.  2  System  Failure  Detector 
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The  74LS86  (quadruple  2-input  XOR  gates)  is  used  as  a  detector  and  the 


74LS109  as  the  switch  (J-K  flip-flop) .  A  74LS125A  is  used  for  buffering 
the  system  outputs.  The  upper  box  contains  peripheral  circuitry:  a 
74LS124  (dual  votage-controlled  oscillators)  is  used  to  generate  the 
clock  signal  and  a  74LS239  (4-bit  binary  counter)  generates  the  inputs 
for  the  fault-tolerant  system.  A  74LS11  (triple  3-input  AND  gates)  ,  a 
74LS04  (hex  inverters)  and  a  74LS32  are  used  to  generate  the  reference 
output  (ref-out)  and  compare  it  to  the  outputs  from  the  system  (bus-outl 
and  bus-out2) .  A  system  failure  occurs  when  bus-outl  or  bus-out2  are 
different  from  the  reference  output.  A  logic  analyzer  and  an 
oscilloscope  are  used  to  monitor  the  behavior  of  the  system. 
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3  INJECTION  OF  TEMPORARY  FAILURES  INTO  THE  SYSTEM 


Experiments  to  evaluate  the  efficiency  of  recovery  mechanisms  in 
fault-tolerant  systems  have  been  reported  in  the  literature.  [Decouty 
80]  describes  a  tool  to  evaluate  the  efficiency  of  fault  detection 
mechanisms.  This  tool  generates  permanent  (s-a-0,  s-a-1)  faults, 

injects  them  into  the  system  under  study  and  observes  the  system 
behavior.  [Crouzet  82]  presents  the  results  of  an  experiment  using  the 
tool  described  in  [Decouty  80];  in  this  experiment  stuck-at  faults  are 
injected  into  a  microcomputer  and  its  behavior  is  monitored.  It  is 
reported  that  approximately  99?  of  the  injected  faults  are  detected.  In 
[Schuette  86],  temporary  s-a-0(1)  faults  are  injected  into  a  system  to 
evaluate  the  coverage  of  the  fault-tolerant  schemes  used  in  the  system. 
In  the  papers  mentioned  above,  temporary  failures  were  modeled  by  the 
stuck-at  fault  model.  [CSrtes  86b]  shows  that  this  model  is  not 
appropriate  for  intermittent  failures.  Therefore,  more  realistic  fault 
injection  methods  need  to  be  used. 

In  this  experiment,  the  fault-tolerant  system  will  be  subjected  to 
two  types  of  stress:  reduced  supply  voltage  and  load.  The  voltage 
stress  simulates  power  supply  disturbances  that  may  cause  transient 
failures  by  affecting  the  noise  immunity.  It  is  shown  in  [C&rtes  86a] 
that  DC  disturbances  and  pulsed  disturbances  have  the  same  effect  on 
chip  behavior.  Therefore,  it  is  reasonable  to  limit  the  experiments 
described  in  this  paper  to  DC  disturbances.  The  loading  stress  reduces 
the  drive  capability  and  simulates  leakage  paths  that  could  induce 
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intermittent  failures  [Cortes  86b],  The  system  will  be  divided  into 
three  subsystems:  1)  The  active  module  2)  The  spare  module  3)  The 
recovery  mechanism.  These  three  subsystems  are  in  the  lower  box  in 
Fig.  3* 

Reducing  the  supply  voltage  of  the  active  (or  spare)  module  is 
accomplished  by  reducing  the  voltage  connected  to  the  Vcc  pin  of  the 
74LS32  containing  the  active  (or  spare)  module.  Reducing  the  voltage 
supply  of  the  recovery  mechanism  is  accomplished  by  tying  the  Vcc  pins 
of  the  detector  (XOR  gate),  switch  (J-K  flip-flop)  and  buffers  to  the 
same  power  supply  and  then  reducing  the  voltage  of  that  power  supply. 

Loading  is  accomplished  by  connecting  a  IK  ohm  variable  resistance 
between  a  node  in  the  system  and  ground  (or  5  volts).  The  resistance  is 
then  decreased  until  an  error  occurs.  The  effect  of  loading  is  such 
that  the  disturbed  lead  in  not  permanently  stuck-at-0  or  1.  To  load  the 
active  (or  spare)  module,  the  variable  resistance  is  connected  between 
the  output  of  one  of  the  OR  gates  and  ground  (or  5  volts).  To  load  the 
recovery  mechanism,  the  variable  resistance  is  connected  between  the 
output  of  the  XOR  gate  and  ground  (or  5  volts). 
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4  SYSTEM  BEHAVIOR  WITH  SUPPLY  DISTURBANCES 

4.1  Supply  disturbances  in  the  active  module 

The  system  is  started  with  the  active  module  connected  to  the 
system  bus.  The  clock  frequency  is  set  at  5  KHz.  At  low  frequency, 
errors  caused  by  supply  disturbances  are  due  to  noise  immunity  problems 


V. 


L"V 

-V' 

k. 


[C&rtes  86a].  Figure  4  illustrates  the  DC  noise  margin.  A  gate  is 
designed  to  produce  an  output  voltage  greater  than  or  equal  to  V0H(min) , 
the  minimum  high  output  voltage  for  worst-case  output  loading. 
Similarly,  VQ^max)  is  the  maximum  low  output  voltage  for  worst-case 
output  loading.  For  a  logic  0  input,  the  corresponding  voltage  must  be 
no  more  than  VIL(max) ,  the  minimum  low  input  voltage  to  guarantee  the 
appropriate  output  logic  level.  For  a  logic  1  input,  the  corresponding 
voltage  must  be  at  least  Vjjj(min)  .  The  difference  between  V0H(mj_n)  and 
VIH(min)  *s  the  hi8^-level  signal-line  noise  margin.  Similarly,  the 
low-level  signal-line  noise  margin  is  the  difference  between  VjL^max) 

and  vOL(max) * 

The  voltage  is  reduced  at  the  Vcc  pin  of  the  74LS32  containing  the 
two  OR  gates  of  the  active  module.  With  an  74LS86  from  Vendor  A,  a 
system  failure  is  observed  (via  the  system  failure  detector  and  the 
logic  analyzer)  and  the  recovery  mechanism  does  not  disconnect  the 
active  module  from  the  bus.  Figure  5  shows  the  waveforms  observed  on 
the  oscilloscope  as  Vcc  is  decreased.  When  Vcc  decreases,  V0H  decreases 
while  VQL  remains  constant.  The  outputs  of  both  OR  gates  in  the  active 
module  behave  similarly.  When  Vq^  reaches  1.85v,  Vq^  increases 


-  ^OH(min) 

IH(min)  - 

IL(max)  - 
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Fig.  5  System  Behavior  with  XOR  from  Vendor  A 
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gradually  while  Vq^  remains  constant.  At  Vqj_  =  1.4v,  oscillations  are 
observed  when  the  output  is  low,  and  a  system  failure  occurs.  The 
buffer  outputs  are  incorrect  and  the  system  failure  detector  (see 
Fig.  2)  indicates  a  system  failure  because  of  the  discrepancy  between 
the  signals  on  the  bus  and  the  reference  output  (ref-out  in  Fig.  2). 
The  oscillations  of  Vq^  are  interpreted  by  the  buffer  as  a  logic  1  while 
the  correct  output  of  the  OR  gates  should  be  0.  During  the  entire 
experiment,  the  output  of  the  XOR  gate  remained  at  0.2v.  Therefore,  the 
detector  was  not  able  to  detect  the  error  and  reconfigure  the  system  by 
disconnecting  the  active  module  and  connecting  the  spare. 

The  result  of  this  experiment  is  intuitive.  The  power  supply 
disturbance  affected  both  OR  gates  in  the  active  module  thereby 
producing  a  common-mode  failure  that  could  not  be  detected.  In  other 
words,  the  stuck-at  fault  model  could  be  used  to  describe  the  behavior 
of  the  system  with  power  supply  disturbances.  The  next  experiment 
however,  invalidates  this  theory. 

The  same  experiment  is  repeated  with  an  74LS86  from  vendor  B.  When 
Vcc  for  the  active  module  (74LS32)  is  decreased,  the  system  recovers 
successfully,  i.e.,  the  active  module  is  disconnected  from  the  bus  and 
the  spare  module  is  connected.  Figure  6  shows  the  waveforms  observed  on 
the  oscilloscope.  As  in  the  previous  experiment,  Vq^  decreases  when  Vcc 
is  decreased.  Both  OR  gates  in  the  active  module  behave  similarly. 
When  VQH  reaches  1.85v,  it  remains  constant  and  Vq^  starts  increasing 
until  it  reaches  0.9v.  At  this  point,  the  output  of  the  XOR  gate 
oscillates.  The  amplitude  of  the  oscillations  at  the  output  of  the  XOR 
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Fig.  6  System  Behavior  with  XOR  from  Vendor  B 
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gate  is  3 .2v.  The  J-K  flip-flop  interprets  the  oscillating  signal  at 
the  output  of  the  XOR  gate  as  a  logic  1,  Q  becomes  1  and  the  recovery  is 
successful.  The  result  of  this  experiment  is  counter-intuitive.  The 
XOR  gate  detected  the  common-mode  failure.  Consequently,  the  stuck-at 
fault  model  cannot  be  used  to  represent  transient  failures  due  to  power 
supply  dips.  The  system  will  recover  if  the  74LS86  produces 

oscillations  with  an  amplitude  high  enough  to  cause  the  J-K  flip-flop  to 
change  states  thereby  disconnecting  the  active  module  from  the  bus  and 
connecting  the  spare  module.  Depending  on  the  electrical 

char acter istics  of  the  components  used  in  the  system,  the  oscillations 
at  the  output  of  the  XOR  gate  could  have  occurred  after  incorrect  data 
was  transmitted  over  the  bus  thereby  producing  a  system  failure.  Also, 
if  the  amplitude  of  the  power  supply  dip  were  not  high  enough,  the 

system  would  not  have  been  affected  at  all.  In  summary,  the  possible 

outcomes  of  the  experiment  described  above  are: 

No  system  failure  (low  amplituue  power  supply  dip). 

XOR  with  high  amplitude  oscillations: 

Successful  recovery  before  data  corruption  (Vendor  B) 

Data  corruption  before  recovery 

XOR  with  low  amplitude  oscillations: 


Data  corruption  and  no  recovery  (Vendor  A) 


Decreasing  Vcc  of  the  active  module  is  a  good  test  to  determine 
whether  the  XOR  gate  can  detect  a  transient  failure  due  to  power  supply 
fluctuation  or  not.  The  experiment  reported  in  this  section  v.as 
repeated  with  XOR  chips  from  two  more  vendors;  in  both  cases,  the  XOR 
gates  produced  oscillations  that  caused  the  J-K  flip-flop  to  change 
states  and  the  system  recovered  successfully.  In  sunmary,  XOR  gates 
with  high  amplitude  oscillations  will  be  able  to  dectect  the  "common- 
mode"  failure.  The  74LS86  from  vendor  A  produced  oscillations  of 
amplitude  0.2v.  This  amplitude  was  not  high  enough  to  cause  the  J-K 
flip-flop  to  change  states. 

A  system  using  XORs  from  vendor  A  will  be  able  to  recover  from 
transient  failures  due  to  power  supply  fluctuations  if  the  two  OR  gates 
in  the  active  module  are  on  different  chips.  The  experiment  described 
above  was  repeated  with  an  74LS86  from  vendor  A  and  the  two  OR  gates  in 
the  active  module  on  different  74LS32s.  Vcc  of  one  of  the  74LS32S  was 
decreased  and  the  system  recovered  successfully  by  disconnecting  the 
active  module  from  the  bus  and  connecting  the  spare  module. 

4.2  Supply  disturbances  in  the  recovery  mechanism 

The  Vcc  pins  of  the  chips  containing  the  recovery  mechanism 
(74LS86,  74LS109  and  74LS125A)  are  tied  to  the  same  power  supply.  With 
the  active  module  connected  to  the  system  bus,  Vcc  for  these  three  chips 
is  lowered.  Figure  7  shows  the  waveforms  observed  on  the  oscilloscope. 
A  system  failure  occurs  followed  by  a  recovery  (disconnection  of  active 


module  and  connection  of  the  spare  module). 


When  the  system  failure 


QQ  = 


01 


Active  Connected  to  Bus 


1 0  Spare  Connected  to  Bus 

1 1  Active  and  Spare  Disconnected 
from  Bus  (System  Failure) 

Fig  7  Waveforms  with  Supply  Disturbances  in  Recovery  Mechanism 


occurred,  the  Q  and  Q*  outputs  of  the  flip-flop  were  switching  between  0 
and  1.  When  Q  and  Q*  were  both  1,  all  four  buffers  were  in  the  high 
impedance  state,  both  modules  (X  and  Y)  were  disconnected  from  the  bus 
and  the  system  had  floating  outputs  and  consequently  failed  because  of 
data  corruption.  Eventually,  Q  settled  at  1  and  Q'  at  0  but  the 
recovery  was  too  late. 

In  summary,  the  power  supply  dip  caused  the  recovery  mechanism  to 
disconnect  the  active  module  from  the  bus  and  connect  the  spare.  During 
the  recovery,  incorrect  data  was  transmitted  over  the  bus  thereby 
producing  a  system  failure. 


5  SYSTEM  BEHAVIOR  WITH  INTERMITTENT  FAILURES 


5.1  Intermittent  failures  In  active  module 

A  IK  ohm  variable  resistance  is  connected  between  the  output  of  one 
of  the  OR  gates  in  the  active  module  and  ground.  With  the  active  module 
connected  to  the  system  bus,  the  resistance  is  decreased  until  a  system 
failure  is  observed.  The  loading  stress  reduces  the  drive  capability 
and  simulates  leakage  paths  that  could  induce  intermittent  failures 
[C&rtes  86a].  After  the  system  failure,  the  J-K  flip-flop  changes  state 
and  the  active  module  is  disconnected  from  the  system  bus.  The  outputs 
of  the  two  buffers  were  different  before  the  XOR  gate  sensed  the 
discrepancy  and  detected  the  error.  The  decreasing  resistance  pulls 
down  the  node  it  is  connected  to.  The  high  output  voltage  (VQH)  at  that 
node  decreases.  VjL  for  the  buffers  being  higher  than  VIL  for  the  XOR 
gate,  the  buffers  go  to  0  while  the  XOR  gate  still  interprets  the 
decreased  voltage  at  the  pulled  down  node,  as  a  logic  1.  VjL  for  the 
buffers  and  the  XOR  gate  were  determined  experimentally.  The  buffer 
input  was  tied  to  a  variable  voltage  source.  Starting  from  5  volts,  the 
input  voltage  was  decreased  until  the  output  of  the  buffer  switched  to 
the  low  output  voltage.  The  input  voltage  at  which  the  switching 
occured  is  For  the  XOR  gate,  one  of  the  inputs  was  connected  to 

the  variable  voltage  source  while  the  other  input  was  connected  to  :  1) 
logic  0,  2)  logic  1.  The  logic  0  (logic  1)  were  obtained  at  the  output 
of  a  buffer  whose  input  was  tied  to  ground  (Vcc)  .  Starting  from  5 
volts,  the  variable  voltage  was  decreased  until  the  output  of  the  XOR 


gate  switched.  The  input  voltage  at  which  the  switching  occured  is  VIL- 
The  values  of  and  specified  by  the  manufacturer  are  0.8v  and  2v 
respectively.  These  values  are  more  conservative  than  the  ones 

determined  experimentally  to  take  into  account  variations  in  the 
manufacturing  process.  Table  1  shows  the  results  of  the  experiment. 


Table  1  and  VjH  for  the  XOR  gate  and  the  buffers 
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The  same  experiment  is  repeated  with  the  resi  "stance  connected 
between  the  OR  gate  output  and  the  power  supply  (5  volts)  instead  of 
ground.  The  resistance  is  decreased.  The  recovery  mechanism  detects 
the  error  and  disconnects  the  active  module  from  the  system  bus.  The 
decreasing  resistance  pulls  up  the  node  it  is  connected  to.  The  logic  0 
voltage  level  (Vq^)  at  that  node  increases.  VjH  for  the  XOR  gate  being 
lower  than  VjH  for  the  buffers,  the  XOR  gate  interprets  the  voltage  at 
the  loaded  node  as  a  logic  1  before  the  buffer  does.  The  flip-flop 
changes  state  before  the  buffer  produces  incorrect  data  and  the  recovery 
is  successful.  VTH  for  the  buffers  and  the  XOR  gate  were  determined 
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experimentally  using  the  same  setup  described  in  the  previous  paragraph, 
The  results  are  shown  in  Table  1.  In  summary,  if: 


VIL(XOR)  >  VIL( Buffer)  and 
VIH(XOR)  <  VIH(Buffer) 

the  XOR  chip  will  detect  intermittent  failures.  Furthermore,  the 
results  of  the  experiments  described  above  show  that  the  stuck-at  fault 
model  is  not  appropriate  in  the  case  of  intermittent  failures.  An 
"intermittent  s-a-O"  fault  at  the  output  of  one  of  the  OR  gates  in  the 
active  module  should  be  detected  but  are  not. 

5.2  Intermittent  Failures  in  the  Recovery  Mechanism 

A  variable  resistance  is  connected  between  the  output  of  the  XOR 
gate  and  the  power  supply  (5  volts).  The  resistance  is  gradually 
decreased  until  the  recovery  mechanism  switches  off  the  active  module 
and  connects  the  spare  module  to  the  system  bus.  The  recovery  is 
successful  and  there  is  no  incorrect  data  on  the  system  bus  before  or 
during  the  changing  of  state  of  the  J-K  flip-flop.  It  is  not  necessary 
to  conduct  the  same  experiment  with  the  variable  resistance  connected  to 
ground  instead  of  the  power  supply.  The  output  of  the  XOR  gate  being  0 
in  the  error-free  condition,  pulling  it  down  to  0  will  not  have  any 
effect  unless  a  failure  occurs  in  the  active  module;  in  this  case,  the 
XOR  gate  will  be  unaole  to  produce  a  1  at  the  output,  the  J-K  flip-flop 
will  not  change  states  and  the  system  will  fail. 


SUMMARY  and  CONCLUSIONS 


Realistic  temporary  failures  are  injected  into  a  simple  system  that 
uses  dynamic  redundancy  for  fault-tolerance.  It  is  shown  that  the 
system  does  not  recover  from  certain  types  of  temporary  failures  and 
that  the  stuck-at  fault  model  is  inappropriate  for  these  temporary 
failures.  Transient  failures  are  induced  by  decreasing  the  power  supply 
voltage.  Intermittent  failures  are  induced  by  loading  nodes  in  the 
system  (to  ground  or  Vcc) .  Decreasing  the  supply  voltage  of  the  active 
module  causes  common-mode  transient  failures  that  may  not  detected  by 
the  recovery  mechanism.  Intermittent  failures  in  the  active  module  may 
or  may  not  be  detected  depending  on  the  particular  electrical 
characteristics  of  the  components  used  in  the  system.  Temporary 
failures  in  the  recovery  mechanism  are  also  studied  and  it  is  shown  that 
some  of  them  produce  a  system  failure. 

Tests  are  recommended  for  XOR  chips  to  guarantee  detection  of 
temporary  failures  due  to  power  supply  fluctuations  and  bounds  are 
derived  for  VjL  and  of  the  XOR  and  the  buffers  to  guarantee 

detection  of  intermittent  failures. 

On-going  research  shows  that  the  same  system  built  with  CMOS 
catalog  parts  exhibits  similar  behavior  in  the  presence  of  temporary 
failures.  More  research  need  to  be  done  to  evaluate  of  efficiency  of 
other  implementations  of  the  recovery  mechanism  as  well  as  other  fault- 
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